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Abstract 

Pooled short hairpin RNA sequencing (shRNA-seq) screens are beconning 
increasingly popular in functional genonnics research, and there is a need to 
establish optinnal analysis tools to handle such data. Our open-source shRNA 
processing pipeline in edgeR provides a connplete analysis solution for 
shRNA-seq screen data, that begins with the raw sequence reads and ends 
with a ranked lists of candidate shRNAs for downstreann biological validation. 
We first sunnnnarize the raw data contained in a fastq file into a nnatrix of counts 
(sannples in the colunnns, hairpins in the rows) with options for allowing 
nnisnnatches and snnall shifts in hairpin position. Diagnostic plots, nornnalization 
and differential representation analysis can then be perfornned using 
established nnethods to prioritize results in a statistically rigorous way, with the 
choice of either the classic exact testing nnethodology or a generalized linear 
nnodelling that can handle connplex experinnental designs. A detailed users' 
guide that dennonstrates how to analyze screen data in edgeR along with a 
point-and-click innplennentation of this workflow in Galaxy are also 
provided. The edgeR package is freely available fronn 
http://www.bioconductor.org. 
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Introduction 

Pooled short hairpin RNA sequencing (shRNA-seq) screens cou- 
ple RNA interference (RNAi) with second generation sequencing 
to enable researchers to elucidate gene function in an unbiased, 
high-throughput manner^ Several recent high impact studies have 
exploited this technology to discover novel genes involved in pro- 
cesses including cell fate decisions of normal and cancer cells, and 
to generate genetic interaction maps in mammalian cells^ "^. 

Pooled shRNA screening relies on the stable genomic integration 
(often by viral transduction) of expression cassettes that allow sta- 
ble or inducible expression of shRNAs targeting specific genes in a 
large population of cells. Depending on the biological question of 
interest, typically two or more cell populations are compared either 
in the presence or absence of a selective pressure, or as a time-course 
before and after a selective pressure is applied. Gain of shRNA repre- 
sentation within a pool suggests that target gene knockdown confers 
some sort of advantage to a cell. Similarly, genes whose knock- 
down is disadvantageous may be identified through loss of shRNA 
representation. Screening requires a library of shRNA constructs 
in a lentiviral or retroviral vector backbone that is used to generate 
a pool of virus for transducing cells of interest. The relative abun- 
dance of these shRNAs in transduced cells is then quantified by 
PGR amplification of proviral integrants from genomic DNA using 
primers designed to amplify all shRNA cassettes equally, followed 
by second-generation amplicon sequencing (Figure lA). Sample- 
specific primer indexing allows many different conditions to be 
analyzed in parallel. 

As the popularity of this approach grows, there is a need to develop 
suitable analysis pipelines to handle the large volumes of raw data 
that each screen generates. The major steps in an analysis involve 
processing the raw sequence reads, assessing the data quality and 
determining representational differences in the screen in a statisti- 
cally rigorous way. 

Two pipelines are currently available for this task. The shALIGN 
program^ is a custom Perl script that trims the sequence reads to 
the pre-defined base positions and then matches these to a library 
of hairpin sequences. Mismatch bases are permitted, and any 
ambiguous matches are ignored from the final hairpin count. Sta- 
tistical analysis of the data is then performed using the shRNAseq 
R package^ which calculates log-ratios of the counts from each 
screen replicate, normalizes these values and ranks hairpins by their 
median, mean or ^-statistic. Another solution is the BiNGS!SL-seq 
program^' that uses Bowtie to perform sequence mapping followed 
by statistical analysis in edgeR^. 

In this article, we describe a complete analysis solution for shRNA- 
seq screens accessible from within the edgeR package available 
from Bioconductorl 

Implementation 

A summary of the main steps in a typical shRNA-seq analysis 
alongside the functions in edgeR that perform each task is given 
in Figure IB. 



Sequence pre-processing 

Our sequence counting procedure has been tailored for screens where 
PGR amplified shRNA constructs of known structure are sequenced 
using second generation sequencing technology (Figure lA). The 
location of each index and hairpin sequence is used to determine 
matches between a specified list of index and hairpin sequences 
expected in the screen with the sequences in the fastq file. Mis- 
matches in the hairpin sequence are allowed to accommodate 
sequencing errors, as are small shifts in the position of the hairpin 
sequence within the read. Analysis of unpublished in-house data 
reveals that allowing for mismatches can yield up to 4.4% additional 
reads, and shifting an extra 2.6%. This simple searching strategy 
is implemented in G, with the user interface provided by the 
processHairpinReads function in edgeR. Input to this func- 
tion consists of a fastq file/s, a second file containing sample IDs 
and their index sequences and a third file listing hairpin IDs and 
their sequences (the latter files are tab-delimited). A screen with 
100 million reads (one lane from an Illumina HiSeq 2000) can be 
processed in 2-15 minutes depending on the processing parameters. 
Fastq processing requires minimal RAM, allowing analysis to be 
completed on any standard computer with R*^ installed. 

The matrix of counts returned by the processHairpinReads 
function, which contains hairpins in the rows and samples in the 
columns, is stored as a DGEList object so that it is fully interoper- 
able with the downstream analysis options available in edgeR. Such 
an object can also be created directly by the user in the event that 
the hairpin counts have been summarized by alternate means. 

Next, the data quality of a screen can be assessed conveniently using 
multidimensional scaling (MDS) plots via plotMDS (Figure IG) 
and access to a range of normalization options is available through 
the calcNormFactors function. 

Differential representation analysis 

The shRNAseq software^ assumes simple experimental set-ups 
(e.g. comparing two conditions) that are unsuitable in more com- 
plicated situations, such as time-course designs. In edgeR, screens 
can be analyzed using either the classic method"', ideal for simple 
two-group comparisons, or generalized linear models (GLMs)^^ for 
more complex screens with multiple conditions (using the glmFit 
function). This framework can accommodate hairpin- specific vari- 
ation of both a technical and biological nature as estimated via the 
estimateDisp function and visualized using plotBCV, which 
plots biological variability as a function of average hairpin abun- 
dance. Statistical testing for changes in shRNA abundance between 
conditions of interest (typically over time) is carried out using exact 
(see exactTest function) or likelihood ratio (glmLRT) tests that 
allow results to be ranked by significance using the topTags func- 
tion and plotted using the plotSmear function (Figure ID). 

Gene set analysis tools available via roast ^- and earner a^^ allow 
researchers to further test and prioritize screen results. This capa- 
bility can be used to obtain a gene -by-gene ranking, rather than a 
hairpin- specific one, which can be helpful when shRNA libraries 
contain multiple hairpins targeting each gene. 
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(A) Overview of shRNA sequencing 



(B) Summary of edgeR workflow 
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(D) Fold-change versus abundance 
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Figure 1. Summary of the raw data, workflow and diagnostic plots from edgeR. (A) Structure of the amplicons sequenced in a typical 
shRNA-seq screen. Each one contains sample and hairpin specific sequences at predetermined locations. After sequencing, the raw data 
is available in a fastq file. (B) The main steps and functions used in an analysis of shRNA-seq screen data in edgeR are shown. (C) Example 
of a multidimensional scaling (MDS) plot showing the relationships between replicate dimethyl sulfoxide (DMSO) and Nutlin treated samples 
(data from Sullivan et al. (201 2)^). MDS plots provide a quick display of overall variability in the screen and can highlight inconsistent samples. 
(D) Plot of log^-fold-change versus hairpin abundance (log^CPM) for the same data. Hairpins with a false discovery rate < 0.05 from an exact 
test analysis in edgeR (highlighted in red) may be prioritized for further validation. 



Case studies and further extensions 

We provide example data sets and a complete analysis script that 
demonstrate how to use the edgeR package to prioritize data from 
four different pooled shRNA-seq screens^^. These examples were 
chosen to showcase edgeR' s ability to deal with experiments of var- 
ying size (from tens to thousands of hairpins) and complexity, from 



two-group situations, to settings with four groups, or a time-course 
design, where a GLM with a slope and intercept term is most appro- 
priate. We have also developed a Galaxy tool^^"^^ that implements 
this workflow as a point-and-click application to improve accessi- 
bility for researchers who are unfamiliar with the R programming 
environment (Figure 2). 
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(A) 



shRNAseq Tool (version 1.0.5) 
Input File Type: 



[ 37; zuber_count_nature.txt 



Hairpin Annotation: 
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Sample Annotation: 



[ 39: zuber_samples_nature.txt 



Filter Low CPM?: 

Ignore hairpins with very low representation when performing analysis. 
Minimum CPM: 



Minimum Samples: 

[2 I 

Filter out all the genes that do not meet the minimum CPM In at least this many samples. 
Analysis Type: 

[ Generalised Linear Model i ] 

j Classic Exact Tests are useful for simple comparisons across two sampling groups. Generalised linear models allow 

G)ntrasts of Interest 

I Day 14 - Day 0 I 



Specify equations defining contrasts to be made. Eg. KD-Control will result in positive fold change If KD has greater 

Perform Gene Level Analysis?: 

fNo 



Analyse LogFC tendencies for hairpins belonging to the same gene. 
FDR Threshold: 
0.05 

All observations below this threshold will be highlighted in the smear plot. 
Absolute LogFC Threshold: 
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EdgeR Analysis Output: 
Input Summary: 

• NumbCT of Samples: 4 

• Number of Hairpins: 1105 
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(D) 



Barcodeplot for Brd4 (Day14 versus DayO) 



Figure 2. Screenshots of the Galaxy tool for shRNA-seq screen analysis powered by edgeR. (A) From the main screen, the user selects 
the appropriate input files and analysis options. (B) The results of an analysis are summarized in an HTML page that includes various 
diagnostic plots. (C) Output also includes a table of ranked results at the hairpin and gene-level (where appropriate) as well as barcode plots 
(D) that highlight the ranks of hairpins targeting a specific gene relative to all other hairpins in the data set. 



Discussion 

Although the major functionaUty of edgeR has been developed 
with RNA-seq data in mind, the analysis of numerous in-house 
data sets^"^ and the results of others' have demonstrated its utility 
for count data derived from shRNA-seq screens. edgeR provides 
users with a unique tool for the analysis of data from this emerg- 
ing application of second generation sequencing technology, that is 
capable of handling both the biological variability and experimental 
complexity inherent in these screens. Provision of a Galaxy module 
puts these powerful statistical methods within reach of experimen- 
talists. Future work will be focused on the use of a suitable control 
data set to compare this analysis pipeline with other approaches 
such as shRNAseq^. We anticipate that the approach for differential 
representation analysis described in this paper will also be useful 
in the analysis of short-guided RNA-seq (sgRNA-seq) knockout 
screens as facilitated by the clustered regularly interspaced short 
palindromic repeats-Cas9 (CRISPR-Cas9) system^^'^'^. 

Software availability 

edgeR is an R'^ package distributed as part of the Bioconductor 
project^ (http://www.bioconductor.org). The Galaxy tool that 
implements this workflow is available from http://toolshed.g2.bx. 
psu.edu/view/shians/shrnaseq. 
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